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Abstract 


Classification is an important tool with many useful applications. Among the many 
classification methods, Fisher’s Linear Discriminant Analysis (LDA) is a traditional 
model-based approach which makes use of the covariance information. However, in the 
high-dimensional, low-sample size setting, LDA cannot be directly deployed because 
the sample covariance is not invertible. While there are modern methods designed to 
deal with high-dimensional data, they may not fully use the covariance information as 
LDA does. Hence in some situations, it is still desirable to use a model-based method 
such as LDA for classification. This article exploits the potential of LDA in more com¬ 
plicated data settings. In many real applications, it is costly to manually place labels 
on observations; hence it is often that only a small portion of labeled data is available 
while a large number of observations are left without a label. It is a great challenge 
to obtain good classification performance through the labeled data alone, especially 
when the dimension is greater than the size of the labeled data. In order to over¬ 
come this issue, we propose a semi-supervised sparse LDA classifier to take advantage 
of the seemingly useless unlabeled data. They provide additional information which 
helps to boost the classification performance in some situations. A direct estimation 
method is used to reconstruct LDA and achieve the sparsity; meanwhile we employ 
the difference-convex algorithm to handle the non-convex loss function associated with 
the unlabeled data. Theoretical properties of the proposed classiher are studied. Our 
simulated examples help to understand when and how the information extracted from 
the unlabeled data can be useful. A real data example further illustrates the usefulness 
of the proposed method. 

KEY WORDS'. Bayes Decision Rule; Classification; Clustering; Difference-convex Algo¬ 
rithm; High Dimension Low Sample Size; Semi-supervised Learning; Sparsity. 
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1 Introduction 


Classification is an important tool in modern statistical analysis. In a classification problem, 
the training data set {(a3j,|/j), i = is obtained from an unknown distribution, 

where Xi G is the observed covariates and i/i is the class label for the ith observation. In 
this article, we focus on binary classihcation, that is a classihcation problem with only two 
possible classes, i/i G {+1,-1}. The goal of classihcation is to obtain a classihcation rule 
(/)(•) based on the training data, such that for any new observation with only the convariates 
X available, its class label y can be accurately predicted as (^{x). 

There are many classihcation methods in the literature. For an overall introduction, see 
[1]. One popular group of methods are the linear classihers, due to their simplicity and 
interpretability. For a linear classiher, the classihcation rule is dehned as sign{/(*)}, where 
f{x) = uj'x + 6, a; G and 6 G M, is a discriminant function linear in x, obtained from 
the training data. Some examples of linear classihers include Fisher’s Linear Discriminant 
Analysis (LDA) [2], Logistic Regression [3], Support Vector Machine (SVM) |11|5], -^-learning 
[6], Distance Weighted Discrimination (DWD) [7], Large-margin Unihed Machine [8], and 
hybrids of SVM and DWD [9l [10] . For a binary linear classiher, a classihcation boundary 
(also known as a separating hyperplane) is induced by {a^ : uj'x -1-6 = 0} which divides the 
sample space into two halves, one for each class. 

Despite the new and fast development of the latter methods above, the LDA method 
is still widely used among practitioners. LDA is a traditional model-based classihcation 
approach which makes use of the covariance information under a Gaussian assumption. 
Because LDA is simple to implement and straightforward to interpret, it is one of the most 
popular statistical methods for classihcation. Lee and Wang m compared the performance 
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of LDA with some machine learning approaches and concluded that LDA would work better 
in cases where the Gaussian assumption is roughly true. That being said, LDA has several 
drawbacks which makes it undesirable in more complicated data settings (see the next two 
subsections). The goal of this article is to enrich the potential of LDA in these settings. 

1.1 Working with High-Dimensional, Low-Sample Size Data 

The High-Dimensional, Low-Sample Size (HDLSS) data setting is very challenging for sta¬ 
tistical learning and it appears in many applied helds such as gene expression micro-array 
analysis, facial recognition, medical image analysis and text mining. In the HDLSS con¬ 
text, classical multivariate statistical methods often fail to give a meaningful analysis [7]. 
For example, there exists an interesting phenomenon called ‘data piling’ for discriminant 
analysis [12]. ‘Data piling’ means that when training data points are projected onto a low¬ 
dimensional discriminant subspace, many of the projections are identical. This phenomenon 
is caused by the fact that the corresponding discriminant subspace, a one-dimensional co¬ 
efficient vector in the case of binary classihcation, is driven by very particular artifacts of 
the realization of the training data. This makes ‘data piling’ an undesirable property for 
discrimination since the classiher performs worse for out-of-sample test data. Other chal¬ 
lenges in the HDLSS setting include the collinearity among predictors, the error aggregation 
over dimensions, among others (see [I3|). Fan and Li [H] gave a comprehensive overview of 
statistical challenges with high dimensionality in diverse disciplines, such as computational 
biology, health studies and hnancial engineering. In particular, they demonstrated that for 
many statistical problems, the model parameters can be estimated as if the best model was 
known in advance, as long as the dimensionality was not excessively high. These challenges 
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have motivated the development of new methods and theory in the HDLSS setting. 

In recent years, many efforts have been made to make classihcation methods more suitable 
for the HDLSS data. The DWD method [7] claimed to enjoy a better discriminant subspace 
than SVM. In addition, a great number of research articles are dedicated to improving 
traditional methods so that they have sparse discriminant coefficient vectors. An underlying 
assumption is that there are only a few variables which truly drive the difference between 
classes. Hence, a variety of methods use regularization approaches to encourage a sparse 
representation of the coefficient vector oj, which in general are obtained from optimizations 
of the form, 

n 

aigminS^ L{uj,h,Xi,yi) + 

’ 1=1 

where L(-) is a loss function to minimize the misclassihcation and p(-) is a penalty term to 
control the model complexity. Common choices of the penalty function include the ^l norm 
penalty, the SCAD penalty [U] and the minimax concave penalty (MCP) [12]. Examples of 
sparse regularization methods include the lasso nil and the elastic net [18] in regression, and 
the norm SVM [TU], the sup-norm SVM [20] and the direct sparse discriminant analysis 
(DSDA) [21] in classihcation. See [22] for a review. 

1.2 Working with Partially Labeled Data 

In many real problems, it is difficult or expensive to obtain the class label information; on the 
other hand, it may be relatively cheap to obtain the covariate information quickly for many 
observations. Hence, it is often the case that there are many observations without labels 
(unlabeled data) and a few observations with labels (labeled data). For instance, in spam 


3 


detection, there are a large number of unidentified emails, but only a small set of identified 
emails are used to train a filter to flag incoming spams. In facial recognition, the training 
data may include a few faces with scars identihed manually and enormous unidentified faces. 
In these situations, one typical research question is how to use both unlabeled and labeled 
data to enhance the prediction accuracy. 

In the big data era, the dimension of the data is often greater than the sample size of 
the labeled data, though not necessarily greater than that of the unlabeled data. In this 
case, we have an HDLSS setting, if considering the labeled data only. On the other hand, 
the unlabeled data may contain useful information to overcome the difficulty caused by the 
high dimensionality. Semi-supervised learning is a class of machine learning techniques that 
make use of both labeled and unlabeled data for modeling. This article is written with the 
belief that unlabeled data, in some cases, can indeed produce considerable improvement in 
learning accuracy. Note that the semi-supervised learning problem is different from the more 
traditional missing data problem in statistics: in the current article, the size of unlabeled 
data is much greater than that of the labeled data. 

Many semi-supervised approaches have been proposed in different settings, including the 
co-training method [23], the EM algorithm [23], the bootstrap method [25], the Bayesian 
network [26], the Gaussian random held and harmonic function [27], the transductive SVM 
(TSVM) [28] |29l [30] , the large-margin based methods [ST] [32] and the graph-based regular¬ 
ization methods [331 [33] [351 [36] . Many of these methods rely on the clustering assumption 
[37] which assumes the closeness between the classihcation and the grouping (clustering) 
boundaries. 

In this article, we aim to improve a model-based classiher, namely the classical LDA 
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method, so that it can be used to classify partially labeled data in a high-dimensional space. 
This is achieved by marrying LDA with a machine learning technique to incorporate the un¬ 
labeled data. The end product of the article is a Semi-Supervised Sparse Linear Discriminant 
Analysis (A^LDA) method. 

The rest of the article is organized as follows. Section 2 starts with an introduction to the 
existing development for sparse LDA, followed by our proposed S'^LDA method. Section 3 
presents the implementation of our method, including the tuning parameter selection issue. 
Some theoretical results are presented in Section 4, followed by numerical studies in Section 
5. Section 6 contains some concluding remarks. The Appendix is devoted to technical proofs. 

2 Semi-supervised LDA in HDLSS Setting 

Consider a binary classihcation problem. Let X = (Xi, • • • ,Xd)' G be the covariates, 
Y G {-|-1,—1} be the class label and and ? 7 ,_ be the sizes of the positive and negative 
classes. The LDA method assumes that X\Y = y N{fiy,Ti), P(y = -|-1) = tti, and 
P(y = —1) = 7 r 2 = 1 — TTi. Given S, and /x_, the Bayes classihcation rule is given by 
0 Bayes('^^ _ where the coefficient vector — /x„) and 

the intercept term -|-—/x_)/2-|-log( 7 ri/ 7 r 2 ). Hence the Bayes rule 

classihes an observation x to the positive class if and only if {a: — — 

/a„) -|- log( 7 ri/ 7 r 2 ) > 0. In practice, since the true distributions are unknown, we use the 
pooled sample covariance S, the sample mean vectors and /i_, and n^/ri- to estimate 
E, fi_, and 111 / 712 , respectively. The resulting classihcation rule is the LDA classiher. 
For many data sets in modern applications, the dimension d can be much greater than 
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the sample size n. In such cases, LDA cannot be directly used because the sample covariance 
S is not invertible with probability one. Moreover, the sample mean difference — /x_) 
may be deviated from the true population mean difference at each dimension, which could 
lead to overhtting of the classifier due to error aggregated over the d dimensions. 

2.1 Sparse LDA 

It is a common practice to overhaul a traditional statistical method by introducing spar¬ 
sity for high-dimensional data. The pioneers of sparse LDA include the nearest shrunken 
centroids classifier jSSj, the ‘naive Bayes’ classiher [39], and the features annealed indepen¬ 
dent rule (FAIR) [in|. These methods are based on the independence rule that ignores the 
correlation among features. The nearest shrunken centroids classifier and the ‘naive Bayes’ 
classifier use only the diagonal of the sample covariance to estimate S while the FAIR method 
conducts feature selection based on the marginal t-statistics in two-sample t tests. Although 
these classifiers are easy to interpret and computationally attractive, they may lose critical 
covariance information and hence may be suboptimal. Strong correlations can exist in high¬ 
dimensional data and ignoring them may lead to misleading feature selection. In particular, 
Mai and Zou IZH pointed out that variable selection can be inconsistent when the correlation 
is ignored; moreover, as the sample size goes to infinity, the Bayes risk may not be achieved 
in this case. 

Recent years, many efforts have been devoted to developing sparse versions of LDA, 
such as the £i-Fisher’s discriminant analysis (FSDA) [H], the regularized optimal affine 
discriminant (ROAD) |12|, the penalized LDA methods [13], the direct sparse discriminant 
analysis (DSDA) [21] and the sparse optimal scoring (SOS) [Hj. Moreover, Mai and Zou 
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[l5] revealed the connection and equivalence between FSDA, DSDA and SOS. 

Although sparse LDA has been a popular research topic, to our best knowledge, little 
progress has been made to generalize the LDA method to a scenario with many unlabeled 
data, a gap which the current article intends to £ 11 . 

2.2 Proposed Method 

Consider a binary classification problem with the labeled data {(a 3 j,|/j), i = l,...nz}, and 
the unlabeled data j = 1, • • • , n^}. The total sample size is n = +n„. Our goal is 

to find a linear classification function to classify a partially labeled dataset, which is of the 
form f{x) = (jj'x + b, by solving the following optimization problem, 

n; rii+riu 

rain Ci'^L{yiJ{xi)) + C2 V U{f{xi))+p{u},b), (1) 

(ju,b 

2=1 i=ni-\-l 

where L(-) is a loss function for the labeled data to control the misclassification, and f/(-) is 
a loss function for the unlabeled data to encourage large margin between two clusters. As 
usual, p(-) is a penalty term which controls the model complexity. The non-negative tuning 
parameter set C = {Ci,C 2 ) balances the trade-off among the misclassification, the large 
margin between clusters and the model complexity. 

It is a general framework to obtain the linear classification function from ([1]), where many 
different loss functions and penalty functions may be used for T(-), U{-) and p(-). Examples 
of L(-) include, among others, the logistic loss L{y,f) = log(l -|- [16]; the hinge loss 

L{y, /) = (!— yf)+ for SVM with its variants L{y, /) = (!— ?//)+ for g > 1 [17]; the V'-loss 
^iy, /) = 1 — sign(|//) when yf>l or yf< 0 , and 2(1 — yf) otherwise [6]. 

Though the squared error loss defined as L{y, f) = {y — fY (for y = n/n^ when y = -|-1, 
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and —n/n_ when y = —1) is more widely used in the regression setting, it can also be used 
for classihcation. This is because the classical LDA method can be exactly reconstructed via 
the least squares regression (see Chapter 4 of DP)- In particular, let be the coefficient 

vector obtained from the regression problem 

n 

= argmin^(yi -b- uj'xif, 

(jJ.h 1 

2 = 1 

where iji = n/n+ when yi = +1, and —n/ri- when yi = —1. It can be verihed that 
UjOLS — — fji_) for some positive constant c, which is along the same direction as 

the LDA coefficient vector — fij). In our proposed S'^LDA approach, we choose to 

use the squared error loss, after coding y G {+1,-1} as y G {n/n^, —n/n_}. The same loss 
was previously considered by DSDA 121 ] for classihcation. 

The second term in ([1]), involving U{-), is associated with the unlabeled data, and is 
included to encourage a large margin between two clusters induced by the classihcation rule. 
This is done by assigning a large loss when the classihcation boundary goes through an 
area with high density, hence, encouraging the classihcation boundary to avoid those areas 
and go through a gap between two clusters. In particular, U{z) is a function with maximal 
value at z = 0 and a decreasing value as \z\ increases. In order to have this property, we 
can modify existing loss functions for classihcation, such as the hinge loss, the logistic loss 
and the -^-loss, by changing yf{x) to |/(a:)| in their dehnitions. For example, the modihed 
logistic loss is U{z) = log(l + and the modihed hinge loss is U{z) = (1 — |^|)+- That is, 
we assign a zero loss when | 2 ;| > 1 and a loss of 1 — | 2 ;| otherwise. Wang and Shen [31] and 
Wang et ah [32] also considered the modihed hinge loss in their machine learning-oriented 
methods for large-margin semi-supervised learning. 


Although U{-) is applied to the unlabeled data only, we expect it to improve the classi¬ 
fication boundary by making the margin wider. For illustration purpose, in the rest of the 
article, we use the modified hinge loss for U{-). But other choices for U{-) are possible. 

In summary, in our proposed A^LDA method, we combine a classical model-based ap¬ 
proach, LDA, and a machine learning-oriented method, to classify high-dimensional partially 
labeled data. They are reflected in the choice of the loss functions L and f/ in ([T]). To be 
specific, we choose L{yJ{x)) = {y- f{x))^ and U{f{x)) = (1 - 1/(3;)|) + . 

2.3 Penalty Term 

Our penalty term p{u:, b) is chosen as ||n;||i -|- c||d>||“^|6|. The first term herein is an ii norm 
penalty of n;, to find variables on which the two classes are significantly different and shrink 
the coefficients for the other variables to zero. Other functions than the ii norm, such as 
the elastic net, SCAD and MCP penalties, may be used as well. 

The second term in the penalty function is included to prevent an undesirable case as 
follows. Note that when the parameters are not chosen wisely (such as when Ci = 0), a 
problem may occur that the classification boundary is pushed to be infinitely far away from 
the data set, because this would induce a zero loss in U{-), with no cost on T(-). However, 
it is obvious that the classification performance is poor in this case since one class is totally 
ignored. To avoid this issue, without loss of generality, we assume that each predictor has 
mean 0 and variance 1; we include an adaptive penalty on the intercept term b, that is, 
as the second term of p{u:,b), to encourage a small value for b. Here d> is an 
initial estimate of the classification coefficient vector uj. Note that, b = 0 indicates that the 
classification boundary goes through the origin point, in which case the undesirable situation 
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is avoided. 


3 Implementation 

In this section, we discuss the implementation of our method. We first introduce the algo¬ 
rithm for optimizing the unusual objective function in ([T]). We then discuss the problem of 
tuning parameter selection. 

3.1 Algorithm 

Solving the problem in ([T]) with the U loss being the modified hinge loss involves a non- 
convex optimization. To overcome this difficulty, we make use of the difference of convex 
functions (DC) algorithm |1H]. The key to the DC algorithm is to decompose the non- 
convex objective function to the difference of two convex functions which leads to a sequence 
of convex optimizations. The sequence of local solutions converges to a stationary point. 
The DC algorithm has been used in several other works to solve non-convex optimization 
problems, such as and |19]. For the sake of self-containment, we provide the brief idea 
of the DC algorithm when applied to our method. 

We first decompose the modihed hinge loss U = (1 — |^|)+ into the difference between 
two convex functions Ui — U 2 , where Ui = (| 2 ;| — 1)+ and U 2 = \z\ — 1. The decomposition 
is displayed in Figure [T] 

Let f{x) = (jj'x + b. Rewrite the objective function as 

ni ni+riu 

Q = CrY,L{y^J{x;)) + C2 Y. U{f{x,)) + \M, + c\\Lj\\-\ 

i=l i=ni-\-l 
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Figure 1: The DC decomposition oi U = Ui — U 2 - Functions f/, Ui and U 2 are represented 
by the solid, dashed and dotted lines respectively. 


which can be decomposed similarly as Qi — Q 2 where 


ni 


Qi = Ci^L{yiJ{xi)) + C 2 ^Ui{f{xi)) + ||u;||i + c||d>|| ^ 6 , 


2 = 1 




where Q 2 = C's ^ U 2 {f{xi)). 
i=i 

Note that both Qi and Q 2 are convex. However, to minimize the non-convex Qi — Q 2 , we 
use a linear approximation to Q 2 , so that the approximated optimization problem is convex. 
Overall, the algorithm is conducted in a three-step iteration as follows. 

Step 1. Set the initial values (ljo, &o) of (u;, b) to be the solutions of the sparse LDA 
with labeled data alone. Set a precision tolerance level £ > 0. 

Step 2. At the (/c-|-l)st iteration, compute 5 (^+ 1 )^ by solving the convex problem 


=argmin Qi{u;,b-, - VQ 2 |(cj(fc), 6 (fc)) 


a;,6 


where Qi(u;, b; is the result of substituting lj in Qi by from the previous iteration 
and VQ 2 |(u;(fc), 6 (fc)) is the gradient vector of Q 2 with respect to (cj', 6)', evaluated at the 
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solution from the previous iteration. 

Step 3. Repeat Step 2 until < e. 

It was shown [21] that the number of iterations required to achieve the precision e is 
o(log(l/£)). 

3.2 Tuning Parameter Selection 

The R^LDA method has three tuning parameters Ci, C 2 and c. The duo (Ci,C 2 ) are our 
main focus, which jointly control the balance among the L loss, the U loss and the penalty. 
A common practice in the literature is to conduct a grid search on a set of parameter 
candidate values and compare their performance for an independent tuning data set. We 
search (Ci, C 2 ) over Ci x C 2 where Ci = {2“^, 2“^,..., 2^} and C 2 = {0,10“^, 10°, 10^}. We 
include zero in C 2 so that our method encompasses a sparse LDA method (DSDA) |2T| which 
uses the labeled data alone. 

Our penalty term is |ci;i| + ■ ■ ■ + Icn^l + c||c(>||“^|6| with an ancillary parameter c. Note 
that ||c(>||“^|6| is approximately the distance from the classihcation boundary to the origin 
point. For the examples in the numerical study, with each variable being normalized to have 
mean 0 and variance 1, we hx a universal value for c = 5 which corresponds to a constraint 
that the distance is less than a universal hxed value. This choice is only reasonable when 
the data are normalized to have similar scales, which is the case in all our numerical studies. 

Note that the tuning set is also partially labeled and the number of labeled data is very 
limited. If we ignore the unlabeled data in the tuning set and compare the misclassihcation 
rate for the labeled data only, the criterion may not be able to reflect the true goodness of 
the classiher. The choice of criterion can be critical for tuning parameter selection in some 
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nontraditional situations, such as in imbalanced data classification problems [50l |5T] . For an 
improvement in the partially labeled data, we propose to use a new criterion which involves 
two components where one component is the number of misclassified observations among the 
labeled data and the other one is a clustering measure for both the labeled and unlabeled 
data. 

In particular, for each pair of (Ci, C 2 ), we train a discriminant function, /. The number of 
misclassified labeled data points can be easily counted. The clustering measure is defined as 
the total number of tuning data points whether labeled or unlabeled, which fall into a 

margin centered at the classification boundary with half-width r], that is we count the number 
of 1/(3?^°'^)! < rj. Here r; is a typical measure of the scale, which is defined as one quarter of 
the sum of the 25th and 75th percentiles of the pairwise distance |/(a;*™®) — /(cc*7"®)| for all 
i ^ i'. A small value of the clustering measure indicates that very few data points are close 
to the classification boundary and hence the margin induced by / is indeed wide. 

The choice of 77 is a critical issue. Our choice of rj is adaptive to the underlying distribution 
of the data. It ensures that a reasonable portion of the data have a fair chance to fall in the 
gap which helps to identify the optimal tuning parameter pair. 

4 Theoretical Property 

In this section, we provide several theoretical justifications of the S'^LDA method. The 
classical LDA method is based on the Gaussian assumption. In this setting, without loss of 
generality, we assume that the two classes have means opposite to each other with respect 
to the origin. We further assume that the prior probabilities are the same. In this case. 
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the Bayes classification boundary passes through the origin since /Bayes(^) = + 

^Bayes ^Bayes _ g_ have the following two propositious that describe the theoretical 
minimizers of risk functions with respect to the L loss and the U loss. 

Proposition 1. Assume that X\Y = +5 ~ Ai(/x, E), X\Y = —5 ~ Ar(— p, E) with E full 
rank and P(y = +<5) = P(y = —6) = 1/2. Let 

uJi = argminE(y — uj' X)^ 

(jj 

be the theoretical minimizer of the risk function with respect to the squared error loss, where 
the expectation is taken with respect to the distribution of {X,Y). Then u^i oc E“^/x. 

Proposition 2. Under the same assumption as Proposition [II Let 0^2 the theoretical 
minimizer of the risk function with respect to the modified hinge loss, with the linear nor¬ 
malization constraint that uj'pL = 1, 

UJ 2 = argniin E(1 — |li;'X|)+, 

(jj- (jj'fi=i 

where the expectation is taken with respect to the marginal distribution of X. Then LO 2 oc 

E-ip- 

Proposition [T] and Proposition [5] show that both theoretical minimizers of the risk func¬ 
tions for the L loss and the U loss in our S'^LDA method have the same coefficient vectors as 
that of the Bayes coefficient vector, up to some multiplicative constants. Theorem [T] shows 
that the theoretical coefficient vector of the unpenalized population version of our A^LDA 
classifier is along the same direction as the Bayes coefficient vector. 

Theorem 1. Assume that X\Y = -|-(5 ~ N{p,,'E), X\Y = —6 ~ A[(—p, E), where S = 
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1/(^'S and P(y = +5) = P(y = —5) = 1/2. Let lo^o be the solution to the theoretical 
coefficient vector of the unpenalized population version of the S^LDA classifier, defined by 


LJoo= argmin E(y - a;'X)2 + ^(1 - |a;'X|)+, (2) 

00 : 00'IJ,=1 

where C > 0 is a constant. Then 00^0 = 

To gain better insight, we reformulate the ii penalized S'^LDA classiher. Theorem [2] fur¬ 
ther reveals the small difference between the ii penalized version of the S'^LDA discriminant 
direction vector and the Bayes optimal direction vector. 

Theorem 2. Let s be the size of the set K := {k : 7 ^ 0} and Aniax(A) and A min (A) 

be the greatest and the smallest eigenvalues of matrix A. Assume that X\Y = -|-(5 
X\Y = -(5 ~ E), where 5 = 1/ffi), and P(y = +5) = P(y = -5) = 1/2. Let 

ojoo be as in ^ and 00 ^ correspond to the penalized version, 


00 ^ = argmin E(y - + CE(1 - |u;'X|)++ A||u;||i, 

OO: LiJ'pL=l 


(3) 


where C > Q is the same constant in /d). Then 


1^ ~ ^oolh < 


A\/i + C\J Amax(S) 


Amin(^) 


where E = E -I- piffi. 


Theorem [ 2 ] characterizes the difference between the S'^LDA solution and the Bayes rule. 
This difference is jointly controled by parameters A and C. When (7 = 0, the problem is 
reduced to the £i-LDA when applied to the labeled data only, in which case — cjoolh 0 
as A ^ 0. Similar results can be seen in the ROAD classiher [12]. 


15 




5 Numerical Study 


In this section we use simulated and real data examples to demonstrate the effectiveness 
of our proposed method. We compare it with linear £i-SVM and a sparse LDA method 
(DSDA) [21], both of which are applied to the labeled data only, and a semi-supervised 
method, SELF [36|. As a benchmark comparison, we also apply these methods to the full 
data set with all the labels available in order to see how much room our S'^LDA can improve. 
The misclassification rate, averaged over 100 replications, is reported. 

5.1 Simulations 

We consider four simulated examples. In each example, we first generate an i.i.d. random 
sample {(ajj, i/i), i = 1,..., n*} where Xi = {xn, ..., Xid)' and d and n* vary, depending on 
the setting. Then we partition the data into training, tuning and test data sets. Importantly, 
we intentionally remove some class labels in the training and tuning sets to create a partially 
labeled data setting. 

In Example 1 and Example 2, 3400 independent instances are generated. We use 200 for 
training, 200 for tuning and the remaining 3000 for testing. 

Example 1: (y -|- l)/2 ~ BernouUi{0.5), Xu ~ iV(1.4Ej,l), and Xj 2 ~ ^"(0,1), 

i = 1,..., 3400. Among the 200 training instances, 190 unlabeled instances (Xji,W 2 ) are 
obtained by removing labels from a randomly chosen subset of the training sample, whereas 
the remaining 10 instances are treated as the labeled data. The 200 tuning instances are 
processed in the same way as the training set. 

Example 2: (Yi + l)/2 ~ BernouUi{0.5), Xu ~ ^(5^,1), Xi 2 ~ ^"(—5^,1) and 
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Xi 3 , ..., XiiQQ ~ N{0, 1), i = 1,..., 3400. Here s indicates the signal level and varies from 1 
to 2.5. Among the 200 training instances, 190 unlabeled instances are obtained by removing 
labels from a randomly chosen subset of the training sample, whereas the remaining 10 
instances are treated as the labeled data. The 200 tuning instances are processed in the 
same way as the training set. 

In Example 3 and Example 4, We generate 10000 instances and study the performance 
of the methods with the increase of the dimensionality. In particular, we simulate datasets 
with dimension d = 20,30,40,50,100,200,500. The number of training instances equals 
200, 244, 283, 316, 447, 632,1000 respectively, which increases at the rate of \/d. The tuning 
sample size is the same as the training sample size in each case, and the remaining instances 
are used for testing. 

Example 3: (17 + l)/2 ~ BernouUi{0.5), Xu, Xis, Xi^, Xij, Xig ~ Ar(17/2.7,1), and 

Xi 2 , Xi 4 , XiQ, Xis, XiiQ ~ N{—Yi/2.7, 1). The covariance of Xu ,..., Xhq is the Toeplitz ma¬ 
trix with the hrst row (1, 0.8, 0.8^,..., 0.8®). Wii, • • •, A'jioo iV(0,1), i = 1,..., 10000. We 
randomly select 10 labeled data from each class and the remaining are treated as unlabeled 
data for both training and tuning sets. 

Example 4: (17 + l)/2 ~ Bernoulli{0.5), W 3 , Wy, Wg ~ x t(5) + Fi/2.7, 
and Xj 2 ,Xj 4 ,Xj 6 , Xj8,Xjio ~ ■\/3/5 x t(5) — The covariance of Xji,...,Xjio is 

the Toeplitz matrix with the first row (1, 0.8, 0.8^,..., 0.8®). Xm ,..., X^ioo X(0,1), 
i = 1,..., 10000. We randomly select 10 labeled data from each class and the remaining are 
treated as unlabeled data for both training and tuning sets. 

In addition to the 5'^LDA and SELF methods for partially labeled data, the £i-SVM 
and the sparse LDA method for both the labeled data only and the full data, we report the 
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Data 

ii-SVMi 

£i-LDAz 

SELF 

S^LDA 

S'^LDA(o) 

£l-SVMe 

ii-LDAc 

Bayes 

Example 1 

0.099 

(0.0020) 

0.096 

(0.0019) 

0.096 

(0.0015) 

0.094 

(0.0021) 

0.084 

(0.0008) 

0.082 

(0.0003) 

0.082 

(0.0002) 

0.080 

Example 2 
(s=1.3) 

0.088 

(0.0033) 

0.080 

(0.0032) 

0.109 

(0.0024) 

0.075 

(0.0034) 

0.056 

(0.0018) 

0.052 

(0.0030) 

0.065 

(0.0007) 

0.033 


Table 1: Simulation results for Examples 1 and 2: misclassification rate with standard 
error for 100 replications for S'^LDA, SELF, £i-LDA and £i-SVM. Subscript I indicates the 
results for labeled data only. Subscript c indicates the results for complete data with label 
information available. S'^LDA(o) indicates the oracle solution of S'^LDA. The Bayes error is 
also provided. The results for Example 1 show that S'^LDA has slightly better performance 
than £i-LDA, £i-SVM and SELF, with a greater potential (error for S'^LDA(o) = 0.084 which 
is almost the complete data errors and the Bayes error). S'^LDA also performs well in an 
HDLSS setting such as Example 2. The table shows the result for s = 1.3. The S'^LDA oracle 
solution is even better than the complete data £i-LDA and close to that of the complete data 
£i-SVM. SELF does not perform well in this example, even worse than methods using the 
labeled data only, possibly due to the restrictions of graph-based methods. 


theoretical Bayes error as a baseline for comparison. We also show the performance of the 
oracle solution to S'^LDA and f'l-LDA on the solution path, which is obtained by selecting 
the optimal parameters based on the performance on the whole test data set. The oracle 
solution (denoted as ‘(o)’ in tables and hgures) can be viewed as the best possible result for 
a method on the whole solution path, given that the tuning parameter selection method can 
effectively hnd the true optimality. It indicates the potential of a method. 

Example 1 is a low-dimensional study, where the sample size of the labeled data is greater 
than the dimension. Example 2, 3 and 4 are based on the HDLSS setting. In Example 2, 
we £x the dimensionality and study the performances with the change of signal strength 
s. Example 3 and Example 4 demonstrate the performance of S'^LDA with the increase 
of dimensionality. We explore a Gaussian case in Example 3 and a non-Gaussian case in 
Example 4. 

For Examples 1 and 2, the misclassihcation rates of S'^LDA and other methods are shown 
in Table [T] The numbers of false positives and false negatives are shown in Table [2l A false 
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Data 

ii-SVMi 

ii-LBAi 

^^LDA 

^^LDA(o) 

£l-SVMe 

£i-LDAc 


FP 

0.09/1 

0.28/1 

0.64/1 

0.45/1 

0.03/1 

0.81/1 


(0.024) 

(0.037) 

(0.039) 

(0.041) 

(0.015) 

(0.032) 

1 U A 1 1 1 1 i H , J. 

FN 

0/1 

0/1 

0/1 

0/1 

0/1 

0/1 


(0) 

(0) 

(0) 

(0) 

(0) 

(0) 


FP 

74.6/98 

3.0/98 

8.1/98 

3.9/98 

16.0/98 

63.5/98 


(0.27) 

(0.12) 

(1.11) 

(0.16) 

(2.24) 

(0.46) 

1 ^ 

FN 

0.48/2 

0.20/2 

0.11/2 

0.05/2 

0/2 

0/2 


(0.042) 

(0.034) 

(0.026) 

(0.018) 

(0) 

(0) 


Table 2: Simulation results for Examples 1 and 2: average number of false positives (FP) 
and false negatives (FN) with standard error for 100 replications. Although S'^LDA is shown 
to have more false positives than the labeled data £i-SVM and £i-LDA in Example 1, the 
complete data £i-LDA is actually the worst. In Example 2, A^LDA and S'^LDA(o) have 
much fewer false positives than both £i-SVM methods and the complete data £i-LDA, while 
they have fewer false negatives than both labeled-data-only methods. 

positive occurs when a zero variable has a nonzero coefficient and a false negative occurs 
when a true variable has a zero coefficient value. As SELF is not a method designed for 
variable selection, we do not consider its false positives and false negatives. 

The results for Example 1 in Table [U show that S'^LDA has slightly better performance 
than the semi-supervised method (SELF) and the £i-LDA and £i-SVM (when they are 
applied to the labeled data), and that it has a great potential (error for the oracle solution 
of S'^LDA = 0.084 which is almost the complete data errors and the Bayes error). In Table 
m although S'^LDA is shown to have more false positives than the labeled data £i-SVM and 
f'l-LDA, the complete data £i-LDA is actually the worse (with 0.81 out of 1 in Example 
1), suggesting that the problem could be due to the ineffectiveness of the LDA methods for 
variable selection in such low-dimensional data. 

A^LDA also performs well in the HDLSS setting such as Example 2. Table [U and |2] show 
the results for the case s = 1.3 in Example 2. S'^LDA again performs slightly better than the 
labeled data £i-LDA and £i-SVM, with the potential to be even better than the complete data 
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Misclassification Rate (Example 2) 



methods 



S3LDA 
S3LDA (0) 

L1 LDA (labeled) 

L1 LDA (complete) 
LI LDA (0) 

LI SVM (labeled) 
LI SVM (complete) 
SELF 
Bayes 


Figure 2: Simulation results for Example 2 with various s values. F^LDA, S'^LDA(o) and 
the Bayes solutions are plotted in bold lines. It can be seen that F^LDA (red line) does 
not perform well when s is relatively small (s = 1.0,1.1,1.2) but performs better than both 
labeled-data-only methods for larger s, and even better than both complete data methods for 
s = 1.8, 2.0, 2.5. The oracle solution (dark red line) is always better than the complete data 
£i-LDA and even its oracle version, which indicates a great potential of F^LDA even when s 
is relatively small. SELF does not perform well in this example possibly because the signal 
between the two classes is too small. As s increases from 1.4 to 2.5, the misclassihcation rate 
of SELF dramatically decreases, which may suggest that it highly relies on the signal level. 

£i-LDA (error for F^LDA oracle = 0.056 which is less than the error for £i-LDAc = 0.065). 
This can possibly be explained by its variable selection performance, shown in the bottom 
half of Table [2l S'^LDA and S'^LDA(o) have much fewer false positives than the t'l-SVM 
(for both complete data and for the labeled data only) and the complete data f'l-LDA, while 
they have fewer false negatives than both labeled-data-only methods. SELF again does not 
perform well in this example, even worse than methods using the labeled data only. This 
may be explained by the restriction of the graph-based methods such as SELF. Graph-based 
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methods often reply on the adjacency matrix which describes the neighboring relationship 
between data points. Such a method encourages neighbors to be classified to the same class. 
Thus they are particularly useful when the signal is large, that is, two clusters can be easily 
identified. However, in Example 2, the two classes almost overlap with each other so that 
the adjacency matrix is not very helpful for classihcation. As the signal increases, SELF 
indeed gets better performance (see Figure [2]). 

We also study the performance of Example 2 with changing signal strength s, shown 
in Figure 121 We consider s G {1.0,1.1,1.2,1.3,1.4,1.5,1.8, 2.0, 2.5}. Different lines and 
symbols represent different methods. S'^LDA, S'^LDA(o) and the Bayes solutions are plot¬ 
ted in bold lines. It can be seen that S'^LDA (red line) does not perform well when s is 
relatively small (s = 1.0,1.1,1.2) but performs better than both labeled-data-only meth¬ 
ods for s = 1.3,1.4,1.5,1.8, 2.0, 2.5, and even better than both complete data methods for 
s = 1.8, 2.0, 2.5. The oracle solution (dark red line) is always better than the complete data 
£i-LDA and its oracle solution, which indicates a great potential of S'^LDA even when s is 
relatively small. As discussed, SELF does not perform well in this example possibly because 
the signal between the two classes is too small. Figure [2] shows that when s increases from 
1.4 to 2.5, the misclassification rate of SELF dramatically decreases, which suggests that it 
highly relies on the signal level. 

To understand the reason S'^LDA does not work well for Example 2 when the signal is 
small but works well for larger s, we plot the training data points for s = 1 and s = 1.5, after 
projected onto the Bayes direction vector, in Figure [31 In the right panel (s = 1.5), there 
is a valley in the density curve; in the left panel (s = 1), even with the theoretically best 
Bayes direction, the gap between the two classes is invisible. For this reason, the unlabeled 
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Projection on Bayes Direction Projection on Bayes Direction 


X positive class 
o negative class 


100 


Figure 3: The training data points in Example 2 projected onto the Bayes direction vector, for 
the cases s = 1 (left) and s = 1.5 (right). The positive and negative classes are represented 
in red and blue respectively. The x-axes show the projected values and the y-axes show 
random jitters for better visualization. Kernel density estimates are displayed as the curves. 


data fail to boost the classihcation performance for F^LDA as no single coefficient direction 
can provide a wider margin than the others. 

The misclassihcation rate results for Example 3 and 4 are displayed in Figure 01 The 
result for Example 3 is shown on the left panel and the result for Example 4 is shown on 
the right panel. We explore a Gaussian case in Example 3 and a non-Gaussian case {t 
distribution) in Example 4. Different methods are illustrated in different lines. The x-axis 
shows the dimensionality of the data. For dimensions up to 100, F^LDA (solid line) has a 
great improvement, for both Gaussian and non-Gaussian cases, over £i-LDA, £i-SVM using 
labeled data alone, and SELF (dashed lines), with an even greater potential (the oracle 
solution of F^LDA, shown in black bold line, is even better). SELF again fails to perform 
better than labeled data £i-LDA and £i-SVM in these two examples. The complete data 
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methods 

S3LDA 
S3LDA (o) 

LI LDA (labeled) 

LI LDA (complete) 
LI SVM (labeled) 

LI SVM (complete) 
SELF 
Bayes 


Figure 4: Simulation results for Examples 3 and 4: misclassification rate over 100 replications 
for 5'^LDA, £i-LDA, £i-SVM and SELF. The result for Example 3 is shown in the left panel 
and the result for Example 4 is shown in the right panel. For dimensions up to 100, F^LDA 
(solid line) has a great improvement over £i-LDA, £i-SVM using labeled data alone, and 
SELF (dashed lines), with more potential (the oracle solution of F^LDA, shown in bold line, 
is even better). F^LDA does not provide much improvement for d = 200 and d = 500, in 
which cases the ratio of the dimensionality and the number of labeled data may be too large. 


£i-SVM, complete data £i-LDA and the Bayes error are also provided (all are close to zero). 
A^LDA does not provide much improvement for d = 200 and d = 500, in which cases the 
ratio of the dimensionality and the number of labeled data may be too large. 

In Figure El we compare the logj^Q-false positives, and the false negatives for Example 
3 and 4, over 100 replications, for A^LDA, compared with the £i-SVM and £i-LDA for the 
labeled data only. The false positives, after taken the logarithm to the base 10, are shown in 
the top rows and the false negatives are shown in the bottom rows. While F^LDA performs 
poorly regarding the false positives, it has fewer false negatives in both examples. 
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Example 3 


Example 4 







20 30 40 50 100 200 500 20 30 40 50 100 200 500 

Dimension 


methods 


S3LDA 

LI LDA (labeled) 
LI SVM (labeled) 


Figure 5: Simulation results for Examples 3 and 4: the log^g’^^lse positives (top) and the 
false negatives (bottom) over 100 replications for 5'^LDA (solid line), £i-LDA (dashed line) 
and -^i-SVM (dotted line) using labeled data only. Both are shown with the increase of 
dimensionality. While A^LDA performs poorly in terms of the false positives, it has fewer 
false negatives in both examples. 

5.2 Real Data Application 

In this section, we analyze the Human Lung Carcinomas Microarray Data set using the 
A^LDA method. This data set was previously analyzed in [52] . Liu et al. [53] used this data 
as a test bed to demonstrate their proposed signihcance analysis of clustering approach. The 
original data contain 12,625 genes. We have hltered the genes using the ratio of the sample 
standard deviation and sample mean of each gene and keep 2,530 of them with large ratios 
[5ll[53]. We apply different methods to compare their classihcation performance. 

The original Human Lung Carcinomas Data contains six classes. We combine the Squa¬ 
mous, SmallCell and Normal subclasses to form the new positive class with sample size of 
40, and combine the Colon and Carcinoid subclasses to form the new negative class with 
sample size of 37. Among the total of 77 observations, we randomly select 28 observations 
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to keep the label information, 14 from each class, and treat the remaining 49 observations 
as unlabeled. The 28 labeled observations are split into 12 and 16 for training and tuning 
respectively. All the unlabeled data are used for training. We do not include the unlabeled 
data in the tuning criterion for this real data example, since the total amount of data is very 
limited and all the unlabeled data are used in training. We repeat the above procedure 100 
times and report the average test errors on the unlabeled data in Table |3l 



ii-SYMi 

ii-LDAi 

SELF 

A^LDA 

£i-SVM, 

£i-LDA, 

Error (%) 

17.77 

14.75 

16.77 

7.49 

0.31 

0 

(SE(%)) 

( 1 . 10 ) 

(0.91) 

(0.98) 

(0.74) 

( 0 . 12 ) 

(0) 


Table 3: Averaged test errors as well as the estimated standard errors (in parenthesis) on 
a more complicated data over 100 independent replications of S'^LDA, SELF, £i-penalized 
SVM and £i-penalized SVM with both labeled data only and complete data respectively. 

In this challenging setting where both classes contain subclasses, S'^LDA works quite 
well, with 92.51% classification accuracy. It outperforms labeled data .^i-SVM, labeled data 
t'l-LDA and SELF, with 12.50%, 8.83% and 11.15% improvements over the three methods 
respectively. Note that here the two classes are not as separated from each other, since both 
contain several possibly heterogeneous subclasses. 

6 Conclusion 

In this article, we propose a semi-supervised sparse Fisher’s linear discriminant analysis 
method in the HDLSS setting. This method is designed for a dataset where a small amount 
of labeled data are available with a large amount of unlabeled data. In contrast to methods 
which rely on labeled data only, our method makes use of the unlabeled data to reconstruct 
the classihcation boundary. This is done by discouraging the boundary to go through an 
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area with high density. 

For parameter tuning, we incorporate both the labeled and unlabeled data to identify 
the parameters with optimal performance. Our method outperforms fi-LDA and £i-SVM 
using labeled data alone in situations where the two classes have small overlap. Otherwise, 
S'^LDA performs as well as the £i-LDA (for the labeled data only). In our numerical study, 
we often see great potentials of the F^LDA through the competitive performance of the oracle 
solution. While in reality it is difficult to hnd the theoretically best tuning parameters with 
the very limited data, future research needs to be focused on improved parameter tuning 
criterion for the partially labeled data. 

The norm penalty is used to handle the HDLSS data. Other penalty terms, such as 
elastic net, SCAD and MCP penalties, are possible as well. One possible reason that our 
S'^LDA does not perform well in terms of false positives is that the information contained in 
the zero variables in the unlabeled data has added additional noise that distracts the variable 
selection goal. 

A^LDA combines the classical model-based linear discriminant analysis and a machine 
learning oriented-technique. The LDA component fully uses the model assumptions and the 
covariance structure; the U loss boosts the performance by extracting useful information from 
the unlabeled data. The proposed method has enriched the capacity of the LDA method 
and extend it to the partially labeled data territory. A related topic and future work is the 
signihcance analysis for partially labeled data [55] . 
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Appendix 


Proof of Proposition [T] 


Let the loss be L{y, u) = {y — u)^. Since we assume the separating hyperplane goes through 


the origin, we consider 


Q{uj) = E[L(y - uj'X)] = E[(r - uj'Xf] 

= P{Y = +<5)Ex|+5[(>^ - ^'XY] + P{Y = -h)Ex|-4(>^ - ^'Xf] 



where 4>+{x) and (p-{x) are the density functions of N{fi, S) and N{—^, S) respectively. 
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The gradient of Q{uj) is 


duj 


j [5 — uj'x){—x)(l)j^{x)dx 


= —6fi + TiU — (5/x + Slj 
= 2Sa; - 26^1. 




Then uji = 6 Tj □ 

Proof of Proposition [2] 

Let the loss be L{u) = (1 — |m|)+. Consider Q{u:) = E,[L{u:'X)]. Note that without loss of 
generality, we assume that separating hyperplane goes through the origin. Our goal is to 
find 


0^2 = argmin Q{uj) 

ijJ: UJ'^=l 

Hence, the Lagrangian is Q{uj) — a (a;' - 1). 

We assume here Y G {+5,-5}. Note Q{ol!) = E[L(lj'X)] = P{Y = +(5)Ex|+5[(l — 
|a;'X|)+] + P{Y = -5)Ex|-4(l - l^'^l)+]- 

For Ex|+ 5[(1 — |c(;'X|)+], since we assume X ~ iVp(/^, S), uj'X ~ N{uj 'Write 

U := uj’X, and Z := We have 

’ y/uXYuj 


Ex|+jl(l-|i.)'X|)+| = E(,l(l-|f/|)+| 


(l + u)fu{u)du+ / (1 




u)fu{u)du 


1 




Here + u)fu{u)du can be written, by change of variable, as 


(1 + u)fu{u)du = 


'-1 


I (1 + + u)'n)fz{z)dz 

VuJ'suj 


(4) 


Note that (-v/n/Sa;)' = . and 


Viv'su) 

The gradient of right hand side of (jl]) is 


f cj'/x V _ _ n;'saj>^-sa;a;'/x 

' “ ■ ■ “ (a;'sa;)3/2 


(jJ'EU) 


UJ'fl _ _ 

Vn^'Eo;, Sn; ^ / ^ , i / ^ ^ ^ a;'Su;^ - Sa;a;'/x 

, + ^^)fz{z)dz + 1 ■ fz{^==) ■ (- -) + 0 

Voj'Sa; Voj'Sa; [Lj'hivy/^ 


v^uFeoJ 


Similarly, Jq^(1 — u)fu{u)du can be written, by change of variable, as 


/ Vcj'eo; 
VLtJ'Eo; 


(1 — \/ Lj'Scuz 


u}'^)fz{z)dz 


whose gradient is 

f v'cj'eo; , 


So; 


-CJ 


'M= 


z - ^i)fz{z)dz + 0-1 ■ fz{ 


^ (u;'Sa;)3/2 ^ 


VlJ+uj 

After taking a summation, we have that the gradient of Ex|+5[(1 — |c(;'X|)+] is 


f VlJ+lJ , Eu; 
Vcu'En; 


^ + lJ^)fz{z)dz + 


l-LJ'/A 

' Vcj'eo; 


Scj 


-u; 


^ Vcj'Eo; 


^ - fi)fz{,z)dz 


VuFeuJ 

The derivation for the gradient of Ex|-5[(1 — |n;'X|)+] is similar, except that the mean 
of X given Y = — 5 is assumed to be —fi. Hence its gradient is 


T,(jj 


+LJ'At 

Vu+tcJ 


VujXlj 

vuYelJ 


z - ^i)fz{z)dz + 


So; 


i+gj'/x 
f VoTecJ / 

VuJXlJ 


z + ii)fz{z)dz 
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The sum of the two gradients above, scaled by P{Y = 6) = P{Y = —5) = 1/2, is 


<9^2 (^) _ So; y 4 ^ 

duj ~2^Juy^ ^ ^2 


$(. 


+cjV 


\/ UJ'YjIjJ' 
-bTiUJ + a/j. 


{«( 

$( 


■v/Zi/sZJ 
—1 + 
Vcu'Sa; 


-$( 


$( 


-1 - u'n, 
y/uj'TiUJ 
1 + (jj'fJ, 


■v/uTSlJ ' 


-$( 

$( 


\fuyYZj' 

+U}'ll 


\/ IjJ'YjIjJ' 


$( 


V^uTSu;' 


Dehne G{m,n) := J^zfz{z)dz. Note that G{m,n) = —G{—n,—m). The term A above 
can be found to be 


G(- 


■v/cj'En;’ \fZAYlc' 


\fuAYZj ’ -v/n/SoJ 


+ G( 


—1 + cj'/x +a;'/x 


\fuPYZj ’ -v/oySaJ' 


G( 


cn'/x 1 + ijj' [I ^ 


\/aySaJ’ 'Juj'YaUJ ' 


= -G( 


Lj'fjL l + Lj'fl, 


■s/cj'So;’ Vcj'En;' 


G( 


-cj'/x 1 — n;'/x^ 


-v/nTScJ’ Vcj'En;' 


G( 


-cj'/j, 1 — 




a/u/SoJ’ y/uj'^UJ 
1 + cn'/x 


G( 


Cj'/x 1 + 


-v/uTEoJ’ v^n/SoJ 


) + 


■v/cu'So; ’ y/uj'^uj 
-Lo'fl 1 — Lo'fl 


y/ ijj'YjIjJ ’ \/a/Sa; 


Setting the gradient of the Lagrangian to be zero, we have a[i + &Ea; = a/x. Then 


0^2 oc S ^/x. 


□ 


Proof of Theorem [T] 

Let 5 = l/(/x'E“^/x). Then in Proposition [H cui = (5S“^/x; moreover, satishes the 

constraint uj'[i = 1 in Proposition |2l Combining the results of Proposition [1] and Proposition 
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[21 we have uJoo = for the joint optimization problem. □ 

Proof of Theorem [2] 

Let R{u:) = E(x,y)(r - oj'Xf + CEx{l - = E^x,Y){Y-u;'Xf and R 2 {uj) 

Ex(l — |c(;'X|)+, where Y G {+5, —<5}. For any uj, we have, 

=E(x,y)[(F'-a;'X)2] 

=E(x,y)[(y-a;'X)(y-a;'X)] 

=E(r2) _ 2E(x,y)[(a;'X)F] + E[uj'XX'u] 

=6^ - 26u'fi + lj'E[E{XX'\Y)]u; 

=( 5 ^ — 26(jj'fj, + uj'{E + 

=6^ — 26u}'fj. + uj'Euj 

>5^ - 26uj'fj, + Aniin(E)||a;||2. 

Let (jj^ = (jJoo + 7^- Recall that (jJ^o = By the dehnition of uj^, we have 

7^ = argmin R(aJoo + 7 ) +-^ll^oo + tIIi 

7: {U^+^Yfl=l 

= argmin g^j) + A ^ [y^l + A ^(|a;^ + 7^] - |u;^|) 

7: 7'^t=o 

= argmin fi'j) 

7: 7'^i=o 

where ^(7) = R{u;oc + 7) and /(7) is dehned as ^(7) + A Efcex- l 7 fc| + ^ Efcei^d^^ + 7 fc| - 
Iw^l). We know that /(7^) — /(O) < 0 by the fact that 7^ minimizes /. Thus we have 


31 


(5) 


^(7^) - ^(0) < A ^(|a;^| - |a;^ + 7^|) - A ^ |7^|. 

k&K k£K<^ 

Note that 5 f( 7 ^) - g{0) = [-Ri(a;oo + 7^) - -Ri(i^oo)] + C[i? 2 (^oo + 7^) - -R 2 (^^oo)]- For 
the first term Ri{u)oo + 7^) ~ Rii^oo), we observe that 

^l('^oo + 7^) -Ri{uJoo) 

=E^x,y)[iY - u;'^X - X)^] - E(x,y)[(A^ - 

=E(x,y)[(A^ - - 2u;'^X{Y - -f^'X) + (iv'^X)^] - E(x,y)[(A^ - 

=/?i(7") - 2W^/x + 2a;'^S7" + - E(x.y)[(A^ - ^'oo^)^] 

=i?i(7^) - 26uj'^fi + 2uj'^t-f^ + - [5^ - 26uj'^^i + u)'^tu)J\ 

=i?i(7") + 2a;'^S7"-52 
=Ri{l^) + 25^l'T.-\T. + - 6^ 

=Rih^)-5^ (6) 

Here the last statement is due to the constraint that = 0. 
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For the second term R 2 {ujao + 7^) — -R2(<^oo), we note that 


^2(^00 + 7 ^) “ ^2(^00) 

=Ex|(l - |u,'„X + 7"'X|)+| - Ex|(l - |u)L,^l)- 
> - E^|(l - |u,'„X + 7"'X|) - (1 - |u,'„X|) 

= -¥.x\\i^'^X+~,^'X\-\u>'^X\ 
w’X + 7"'X - w'X 


> - Ex 


= -Ex|7^'^ 


>-JEx[(7"'^P] 




( 7 ) 


Combining (|6]) and ([ 7 ]), we have 


Rih^) -S^- cVA„.ax(s) ■ 117II2 < 9h^) - ^?(0) 


Due to this, along with ([5]), we have 


fli(7t<A^( 


Icn^ I - la;'' 

I^CX>I 1^00 


+ 7I) - A 


k&K 
\2 


keK^ 


+ i" + C\/A„„(i]). ||7"||2 

<A 5^(l7it) + ■;" + cC„.x(S) ■ 117^112 

k&K 

<A\/s||7^||2 + (5^ + C\J Aniax(S) ■ ||7^I|2 
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Recall that 


RiM >S^- 2<57"V + A„,i„(S)||7"||^ = 5' + A^in(S)||7ll2 

Combining the lower and upper bounds of i?i( 7 ^), we have 

+ A^i„(S)||7^||2 < AVi||7"l|2 + + C'^A^ax(S) ■ II 7 II 2 

^ A^in(S)||7^ll2 < AVi||7^||2 + C'^A„,ax(S) ■ ||7^l|2 


Thus 


llT^lh < 


Aa/s + Cy Amax(S) 

Amin(S) 


□ 
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